Execute: Cmd+Shift+Enter. New Chunk: Cmd+Option+I. Preview: Cmd+Shift+K Notes: The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
library(leaps)
library(modelr)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-26. For overview type 'help("mgcv-package")'.
library(ggplot2)
library(DataExplorer)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
##
## collapse
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(ggExtra)
library(caret)
## Loading required package: lattice
library(reshape2)
library(RColorBrewer)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(corrplot)
## corrplot 0.84 loaded
Due to the large number of libraries in use I have provided system information.
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS/LAPACK: /anaconda3/lib/R/lib/libRblas.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] corrplot_0.84 plotly_4.8.0 RColorBrewer_1.1-2
## [4] reshape2_1.4.3 caret_6.0-81 lattice_0.20-38
## [7] ggExtra_0.8 GGally_1.4.0 dplyr_0.8.0.1
## [10] DataExplorer_0.7.0 ggplot2_3.1.0 mgcv_1.8-26
## [13] nlme_3.1-137 modelr_0.1.3 leaps_3.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.0 tidyr_0.8.2 viridisLite_0.3.0
## [4] jsonlite_1.6 splines_3.5.1 foreach_1.4.4
## [7] prodlim_2018.04.18 shiny_1.2.0 assertthat_0.2.0
## [10] stats4_3.5.1 yaml_2.2.0 ipred_0.9-8
## [13] pillar_1.3.1 backports_1.1.3 glue_1.3.0
## [16] digest_0.6.18 promises_1.0.1 colorspace_1.4-0
## [19] recipes_0.1.4 htmltools_0.3.6 httpuv_1.4.5.1
## [22] Matrix_1.2-15 plyr_1.8.4 timeDate_3043.102
## [25] pkgconfig_2.0.2 broom_0.5.1 purrr_0.2.5
## [28] xtable_1.8-3 scales_1.0.0 later_0.8.0
## [31] gower_0.1.2 lava_1.6.5 tibble_2.0.1
## [34] generics_0.0.2 withr_2.1.2 nnet_7.3-12
## [37] lazyeval_0.2.1 survival_2.43-3 magrittr_1.5
## [40] crayon_1.3.4 mime_0.6 evaluate_0.12
## [43] MASS_7.3-51.1 class_7.3-15 tools_3.5.1
## [46] data.table_1.12.0 stringr_1.4.0 munsell_0.5.0
## [49] networkD3_0.4 compiler_3.5.1 rlang_0.3.1
## [52] grid_3.5.1 iterators_1.0.10 htmlwidgets_1.3
## [55] igraph_1.2.4 miniUI_0.1.1.1 rmarkdown_1.11
## [58] gtable_0.2.0 ModelMetrics_1.1.0 codetools_0.2-16
## [61] reshape_0.8.8 R6_2.4.0 gridExtra_2.3
## [64] lubridate_1.7.4 knitr_1.21 stringi_1.3.1
## [67] parallel_3.5.1 Rcpp_1.0.0 rpart_4.1-13
## [70] tidyselect_0.2.5 xfun_0.4
sapply(c('repr', 'IRdisplay', 'IRkernel'), function(p) paste(packageVersion(p)))
## repr IRdisplay IRkernel
## "0.19.2" "0.7.0" "0.8.15"
First we inspect the raw records using bash:
head -n 5 data/Carseats_org.csv
## "","Sales","CompPrice","Income","Advertising","Population","Price","ShelveLoc","Age","Education","Urban","US"
## "1",9.5,138,73,11,276,120,"Bad",42,17,"Yes","Yes"
## "2",11.22,111,48,16,260,83,"Good",65,10,"Yes","Yes"
## "3",10.06,113,35,10,269,80,"Medium",59,12,"Yes","Yes"
## "4",7.4,117,100,4,466,97,"Medium",55,14,"Yes","Yes"
Now I load the data into r, and drop the “ID” column.
carseats <- read.csv("data/Carseats_org.csv", header = T, stringsAsFactors = T)
drops <- c("X")
carseats <- carseats[ , !(names(carseats) %in% drops)]
head(carseats, 10)
Here I create two new dataframes to manage numeric and categorical data.
# get vectors of continuous and categorical cols
nums <- dplyr::select_if(carseats, is.numeric)
cats <- dplyr::select_if(carseats, is.factor)
nums[sample(nrow(nums), 10), ]
cats[sample(nrow(cats), 10), ]
Let’s get quick summaries of each:
summary(nums)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price Age Education
## Min. : 10.0 Min. : 24.0 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
summary(cats)
## ShelveLoc Urban US
## Bad : 96 No :118 No :142
## Good : 85 Yes:282 Yes:258
## Medium:219
str(nums)
## 'data.frame': 400 obs. of 8 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : int 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : int 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : int 120 83 80 97 128 72 108 120 124 124 ...
## $ Age : int 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : int 17 10 12 14 13 16 15 10 10 17 ...
str(cats)
## 'data.frame': 400 obs. of 3 variables:
## $ ShelveLoc: Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
This command is to inspect the different data types in the data.
str(carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : int 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : int 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : int 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : int 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : int 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
print("")
## [1] ""
print(paste('Number of Columns:', ncol(carseats)))
## [1] "Number of Columns: 11"
print(paste('Number of Numeric Columns:', ncol(nums)))
## [1] "Number of Numeric Columns: 8"
print(paste('Number of Categorical Columns:', ncol(cats)))
## [1] "Number of Categorical Columns: 3"
Here’s a quick way to examine general properties of the data:
DataExplorer::introduce(data=carseats)
I start out with a few general scatter plots.
plot_ly(data=carseats,
x=~Age,
y=~Sales,
mode = 'markers',
type = 'scatter',
color=~ShelveLoc) %>%
layout(title = "Age, Shelf Location, and Sales Scatter Plot", width=900)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
This plot below shows good separation and a weak linear trend. These variables are worth investigating.
plot_ly(data=carseats,
x=~Price,
y=~Sales,
mode = 'markers',
type = 'scatter',
color=~ShelveLoc) %>%
layout(title = "Price, Shelf Location, and Sales Scatter Plot", width=900)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
Here we inspect the density of the ‘Price vs Sales’ relationship:
plot_ly(data=carseats,
x=~Price,
y=~Sales,
mode = 'markers',
size = ~Price,
type = 'scatter',
colors = "Dark2",
alpha = .6) %>%
layout(title = "Price, US, and Sales Scatter Plot", width=900)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
## Warning: `line.width` does not currently support multiple values.
I choose to normalize the numeric data in order to be able to plot each variable on the same scale. This will allow me to investigate the variation of each predictor relaticve to Sales.
preObj <- preProcess(nums, method=c("center", "scale"))
scaled.nums <- predict(preObj, nums)
head(scaled.nums)
str(scaled.nums)
## 'data.frame': 400 obs. of 8 variables:
## $ Sales : num 0.7095 1.3185 0.9078 -0.0341 -1.1849 ...
## $ CompPrice : num 0.849 -0.911 -0.781 -0.52 1.045 ...
## $ Income : num 0.155 -0.738 -1.203 1.12 -0.166 ...
## $ Advertising: num 0.656 1.408 0.506 -0.396 -0.547 ...
## $ Population : num 0.0757 -0.0328 0.0282 1.3649 0.51 ...
## $ Price : num 0.178 -1.385 -1.512 -0.794 0.515 ...
## $ Age : num -0.699 0.721 0.35 0.104 -0.946 ...
## $ Education : num 1.183 -1.4882 -0.725 0.0382 -0.3434 ...
print("")
## [1] ""
summary(scaled.nums)
## Sales CompPrice Income
## Min. :-2.65440 Min. :-3.12856 Min. :-1.70290
## 1st Qu.:-0.74584 1st Qu.:-0.65049 1st Qu.:-0.92573
## Median :-0.00224 Median : 0.00163 Median : 0.01224
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.64575 3rd Qu.: 0.65375 3rd Qu.: 0.79834
## Max. : 3.10670 Max. : 3.26225 Max. : 1.83458
## Advertising Population Price
## Min. :-0.9977 Min. :-1.72918 Min. :-3.87702
## 1st Qu.:-0.9977 1st Qu.:-0.85387 1st Qu.:-0.66711
## Median :-0.2459 Median : 0.04858 Median : 0.05089
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.8067 3rd Qu.: 0.90693 3rd Qu.: 0.64219
## Max. : 3.3630 Max. : 1.65671 Max. : 3.17633
## Age Education
## Min. :-1.74827 Min. :-1.48825
## 1st Qu.:-0.83779 1st Qu.:-0.72504
## Median : 0.07268 Median : 0.03816
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.78255 3rd Qu.: 0.80137
## Max. : 1.64673 Max. : 1.56457
Here are scaled distributions (histograms and density plots) for each numeric variable, including Sales. The variables relating to money ($) tend to be approximately normal. Many other variables tend to be approximately uniform, which does not bode well for their predictive power.
scaled.nums %>%
tidyr::gather() %>%
ggplot(aes(x=value,y=..density..))+
ggtitle('Distributions of Continous Variables (scaled)') +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill=I("orange"), col=I("black")) +
facet_wrap(~ key, scales = "free") +
geom_density(color="blue", fill='light blue', alpha = 0.4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here we plot all numeric variables against their distributions. This is just another way to examine the information shown above.
scaled.nums %>%
tidyr::gather() %>%
plot_ly(x=~key, y=~value,
type = "box",
boxpoints = "all",
jitter = 0.4,
pointpos = 0,
color = ~key,
colors = "Dark2") %>%
subplot(shareX = TRUE) %>%
layout(title = "Numeric Variable Distributions (scaled)",
yaxis=list(title='Standard Deviation'),
xaxis=list(title='Variable'),
autosize=FALSE,
width=900,
height=900)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
Here we plot all numeric variables against Sales (scaled). This allows us to investigate possible linear relationships between that variable and Sales. As shown below, only ‘Price’ appears to have a linear relationshop worth investigating.
scaled.nums %>%
tidyr::gather(-Sales, key = "var", value = "value") %>%
split(.$var) %>%
lapply(function(d) plot_ly(d, x=~value, y=~Sales,
mode = 'markers',
type = 'scatter',
colors = 'Set3',
size = ~Sales^2,
marker = list(line = list(
color = 'black',
width = 2)),
name=~var)) %>%
subplot(nrows=7, shareY=TRUE) %>%
layout(title = "Scatterplot Numeric vs Sales (scaled)<br>Size=Sales^2",
autosize=FALSE,
width=900,
height=1500,
yaxis=list(title='Sales'))
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
By adding naive regression lines to a few scatterplots we can confirm our suspicions:
fit.Pop <- lm(Sales ~ Population, data = scaled.nums)
fit.Age <- lm(Sales ~ Age, data = scaled.nums)
fit.CompPrice <- lm(Sales ~ CompPrice, data = scaled.nums)
fit.Price <- lm(Sales ~ Price, data = scaled.nums)
p1 <- plot_ly(scaled.nums,
x = ~Population,
name = 'Population vs Sales Regression Line') %>%
add_markers(y = ~Sales,
name = 'Population vs Sales Observations') %>%
add_lines(x = ~Population,
y = fitted(fit.Pop))
p2 <- plot_ly(scaled.nums,
x = ~Age,
name = 'Age vs Sales Regression Line') %>%
add_markers(y = ~Sales,
name = 'Age vs Sales Observations') %>%
add_lines(x = ~Age,
y = fitted(fit.Age))
p3 <- plot_ly(scaled.nums,
x = ~CompPrice,
name = 'CompPrice vs Sales Regression Line') %>%
add_markers(y = ~Sales,
name = 'CompPrice vs Sales Observations') %>%
add_lines(x = ~CompPrice,
y = fitted(fit.CompPrice))
p4 <- plot_ly(scaled.nums,
x = ~Price,
name = 'Price vs Sales Regression Line') %>%
add_markers(y = ~Sales,
name = 'Price vs Sales Observations') %>%
add_lines(x = ~Price,
y = fitted(fit.Price))
subplot(p1, p2, p3, p4, nrows=2, shareY=TRUE) %>%
layout(title = "Scatterplot Numeric vs Sales (scaled) <br> With Regression Lines",
autosize=FALSE,
width=1000,
height=700,
yaxis=list(title='Sales'))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
Let’s compare the slopes:
scaled.nums %>%
plot_ly(y = ~Sales) %>%
add_lines(x= ~Population, y = fitted(fit.Pop),
name = "fit.Pop slope", line = list(shape = "linear")) %>%
add_lines(x= ~Age, y = fitted(fit.Age),
name = "fit.Age slope", line = list(shape = "linear")) %>%
add_lines(x= ~CompPrice, y = fitted(fit.CompPrice),
name = "fit.CompPrice slope", line = list(shape = "linear")) %>%
add_lines(x= ~Price, y = fitted(fit.Price),
name = "fit.Price slope", line = list(shape = "linear")) %>%
layout(title = "Regression Lines vs Sales (scaled)",
autosize=FALSE,
width=700,
height=500,
yaxis=list(title='Sales'),
xaxis=list(title='Scaled Numeric Variable'))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
Here’s a pretty graphic that doesn’t help me understand anything about the data.
y = scaled.nums$Sales
x = scaled.nums$Price
s <- subplot(
plot_ly(x = x, color = I("black"), type = 'histogram'),
plotly_empty(),
plot_ly(x = x, y = y, type = 'histogram2dcontour', showscale = F),
plot_ly(y = y, color = I("black"), type = 'histogram'),
nrows = 2, heights = c(0.2, 0.8), widths = c(0.8, 0.2),
shareX = TRUE, shareY = TRUE, titleX = FALSE, titleY = FALSE)
## Warning: No trace type specified and no positional attributes specified
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
layout(s, showlegend = FALSE, autosize=FALSE,
width=700,
height=500,
yaxis=list(title='Sales'),
xaxis=list(title='Price'))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
First, let’s create a dataframe that we can use:
categorical.by.sales = cbind(Sales = scaled.nums$Sales, cats)
Let’s try ’em all at once. This works, but I don’t manage the colors very well.
colr = brewer.pal(9, "Set1")
categorical.by.sales %>%
tidyr::gather(-Sales, key = "var", value = "value") %>%
split(.$var) %>%
lapply(function(d) plot_ly(d, x=~paste(var, '<br>' , value), y=~Sales,
type = "box",
boxpoints = "all",
jitter = .2,
pointpos = 0,
color =~paste(var, ":", value),
colors = sample(colr, length(unique(d$value))))) %>%
# colors = "Set1")) %>%
subplot(shareY = TRUE, nrows=3) %>%
layout(title = "Categorical Variable Distributions vs Sales (scaled)",
yaxis=list(title='Sales Standard Deviation'),
xaxis=list(title=''),
autosize=FALSE,
width=900,
height=1500)
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
I run these each separately and get much nicer plots.
categorical.by.sales %>%
plot_ly(x = ~US, y = ~Sales,
split = ~US,
type = 'violin',
box = list(visible = TRUE),
meanline = list(visible = TRUE)) %>%
layout(xaxis = list(title = "US"),
yaxis = list(title = "Sales",zeroline = FALSE))
categorical.by.sales %>%
plot_ly(x = ~Urban, y = ~Sales,
split = ~Urban,
type = 'violin',
box = list(visible = TRUE),
meanline = list(visible = TRUE)) %>%
layout(xaxis = list(title = "Urban"),
yaxis = list(title = "Sales",zeroline = FALSE))
Now this is interesting. We suspected that ‘ShelveLoc’ would be important based on one of the early scatterplots. It seems that this is the case.
categorical.by.sales %>%
plot_ly(x = ~ShelveLoc, y = ~Sales,
split = ~ShelveLoc,
type = 'violin',
box = list(visible = TRUE),
meanline = list(visible = TRUE)) %>%
layout(xaxis = list(title = "ShelveLoc"),
yaxis = list(title = "Sales",zeroline = FALSE))
First, let’s merge the data set into a single dataframe
scaled.merged <- merge(scaled.nums, categorical.by.sales, by="Sales")
head(scaled.merged)
First, let’s look at some things that may give us trouble. Luckily it looks like the only serious correlation is with our dependent variable.
res <- cor(scaled.nums)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
It appears that residuals are roughly symmetrical around 0. That’s strange. Mostly due to a relatively poor overall fit. Note how close to zero most of the coefficient estimates are.
simple.lm <- lm(Sales~., data=scaled.merged)
summary(simple.lm)
##
## Call:
## lm(formula = Sales ~ ., data = scaled.merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32467 -0.29881 -0.01677 0.27320 1.82851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.70000 0.06018 -11.633 < 2e-16 ***
## CompPrice 0.44976 0.02515 17.883 < 2e-16 ***
## Income 0.11240 0.02029 5.539 4.78e-08 ***
## Advertising 0.24069 0.02389 10.076 < 2e-16 ***
## Population 0.01656 0.02134 0.776 0.438
## Price -0.70235 0.02530 -27.759 < 2e-16 ***
## Age -0.23287 0.02012 -11.576 < 2e-16 ***
## Education -0.02619 0.02029 -1.291 0.197
## ShelveLocGood 1.39901 0.06043 23.152 < 2e-16 ***
## ShelveLocMedium 0.59700 0.04974 12.003 < 2e-16 ***
## UrbanYes 0.03785 0.04353 0.870 0.385
## USYes 0.06868 0.04773 1.439 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4672 on 538 degrees of freedom
## Multiple R-squared: 0.7543, Adjusted R-squared: 0.7492
## F-statistic: 150.1 on 11 and 538 DF, p-value: < 2.2e-16
Here we define a new model with some interaction terms: a. Income and Advertising b. Income and CompPrice c. Price and Age
interaction.lm <- lm(Sales~. + Income*Advertising + Income*CompPrice + Price*Age, data=scaled.merged)
summary(interaction.lm)
##
## Call:
## lm(formula = Sales ~ . + Income * Advertising + Income * CompPrice +
## Price * Age, data = scaled.merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31721 -0.30723 -0.01874 0.28583 1.91559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.69732 0.05993 -11.635 < 2e-16 ***
## CompPrice 0.44981 0.02500 17.991 < 2e-16 ***
## Income 0.10889 0.02028 5.369 1.18e-07 ***
## Advertising 0.23790 0.02377 10.008 < 2e-16 ***
## Population 0.01367 0.02121 0.645 0.5194
## Price -0.70114 0.02513 -27.903 < 2e-16 ***
## Age -0.23142 0.02003 -11.553 < 2e-16 ***
## Education -0.03038 0.02024 -1.501 0.1339
## ShelveLocGood 1.39095 0.06040 23.028 < 2e-16 ***
## ShelveLocMedium 0.58444 0.04973 11.752 < 2e-16 ***
## UrbanYes 0.03659 0.04323 0.846 0.3977
## USYes 0.07076 0.04740 1.493 0.1360
## Income:Advertising 0.03748 0.01999 1.875 0.0613 .
## CompPrice:Income -0.05181 0.02053 -2.524 0.0119 *
## Price:Age 0.01399 0.01991 0.703 0.4826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4637 on 535 degrees of freedom
## Multiple R-squared: 0.7593, Adjusted R-squared: 0.753
## F-statistic: 120.5 on 14 and 535 DF, p-value: < 2.2e-16
Which variables ( and interaction pairs ) are significant ? Do you find anything surprising or counterintuitive in the results? Describe any response variable or lack of response that seems surprising.
Repeat the above but using the “leaps” package to perform subset selection, using the full model using ONLY the following variables:
Price Population CompPrice Income Education Age Advertising ShelveLoc
Plus the following interaction term: Income and Age Income and Advertising Income and CompPrice
Use the regsubsets command in “leaps” to do the full model ( do NOT specify really.big=T as it will require a long time to complete, along with forward and backward subset selection. Use the R file, Subset_select.R as a guide for syntax.
The result of the regressions is a model object, so you can query the proper coefficient of the model to determine which model, of the multiple models fit, produces the best metric.
Create a table, or simply list, for the full model, and forward and backward selection, which model # is the best, then write out the model, that is put the model in the form y=intercept + coefficient*variable, for the best model using the BIC measure. NOTE: these may be the same model, or may differ slightly.
Info on the Carseats data follows:
Carseats data info: